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METHOD AND APPARATUS FOR IMPROVING THE PERFORMANCE OF 
A FLOATING POINT MULTIPLIER ACCUMULATOR 

BACKGROUND 

5 

Field of the Invention 

The present invention relates to the field of computer processors and 
floating point mathematical support in computer processors. More 
specifically, this invention relates to improving the performance of a floating 
10 point multiplier accumulator. 

Background 

Computers are ubiquitous in modern society. Computers are regularly 
used for complex mathematical renderings required by modern computer 
graphics demands as well as traditional accounting, architecture and other 

15 specialized mathematically intensive application programs. Mathematical 

computations which require vary large numbers, require high precision, 
and /or include complex mathematical equations are referred to as floating 
point calculations. When programming software, floating point numbers are 
used when performing floating point calculations. Floating point numbers 

20 are commonly defined as having three parts: a sign, a significand (also known 

as a mantissa), and an exponent, A well known standard that sets a 
framework for how floating point numbers and calculations should be 
implemented is I.E.E.E, standard 754 (1985, reaffirmed 1990), the Standard for 
Binary Floating point Arithmetic, available from the Institute of Electrical 

25 and Electronics Engineers, Inc., 445 Hoes Lane, Piscataway, New Jersey 08855- 

1331 (the LE.E.E. Floating point Standard). 

Floating point support has been implemented in a number of ways 
with processors. In earlier personal computers, a floating point co-processor 
was optionally available to be installed with and to assist a processor in 

30 handling floating point calculations {e,g.r Intel Corporation provided a 

Numeric Processor Extension chip named the 8087 to accompany the widely 
used 8086 processor). As personal computers have evolved, processors have 

1 
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incorporated floating point capability within a processor by including one or 
more floating point units in a processor. 

Traditionally, only specialized scientific and accounting application 
programs accessed a processor's floating point capabilities. However, today, 
5 colorful graphic and multimedia images are in widespread use in, for 

example, internet web pages, architectural software applications, computer 
games, and animation creation programs. Further, the use of Digital Video 
Disks (DVD) and the impending on-demand download of video presentations 
such as movies will cause increased usage of floating point capabilities of 
10 processors in computers and more specialized viewing devices. In all of 

these uses, images are stored in various compressed or encoded formats. The 
more detailed and higher resolution an image is, the more floating point 
calculations are needed to process {i.e., decompress or decode) and render the 
image for display on a monitor or other image generating device. As the use 
15 of graphic images has become popular and continues to grow, the use of a 
processor's floating point mathematical capabilities has been increasing. In 
addition, many other computer and processor uses, including use for audio 
processing, are also contributing to an increased use of a processor's floating 
point mathematical capabilities. To accommodate these and other needs, and 
20 to meet the ever growing demand for increased floating point performance, 

the floating point capability of processors is continually evolving. Any 
incremental increase in floating point throughput will increase the 
throughput of processors, computers, viewing devices, and any other systems 
utilizing the floating point capabilities of a processor. 

25 

BRIEF DESCRIPTION OF THE DRAWINGS 
Figure 1 depicts a prior art floating point multiplier accumulator. 
Figure 2 depicts the flow of actions taken according to a prior art 
method of implementing a floating point multiplier accumulator. 
30 Figure 3 depicts one embodiment of a floating point multiplier 

accumulator according to the present invention. 
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Figure 4 depicts the flow of actions taken according to the method of 
implementing a floating point multiplier accumulator according to the 
present invention. 

5 DETAILED DESCRIPTION 

Floating point numbers are implemented in a limited bit space, often 
in 32, 64, and 128 bit widths. Whenever computations are performed on 
floating point numbers, the bit width of the significand may be exceeded. For 
ease of reference, the bits beyond the designated significand bit width are 

10 described herein as "fallout'' bits. Depending on the selected rounding mode 

and the fallout bits, the result of the computation may have to be modified by 
rounding. The discussion and invention herein involves the processing and 
computation of the significand portion of floating point numbers. For ease of 
reference, the term floating point number is used even though the significand 

15 is what is being acted on. 

A floating point multiplier accumulator (FMAC) by definition receives 
as input three floating point numbers. A, B and C, and produces (A x B + C ) 
as a result. In traditional, prior art systems, an FMAC processes (A x B + C ) to 
create what is known as SUM and CARRY, adds SUM and CARRY, shifts or 

20 normalizes the result, and then performs rounding. According to the present 

invention, to increase performance of an FMAC, when the SUM and CARRY 
are added together, the resulting sum, the resulting sum plus one, and the 
resulting sum plus two are computed in parallel so that the value necessitated 
by any needed rounding is computed in advance of the traditional rounding 

25 step. According to the present invention, the appropriate result is then 

chosen according to the rounding mode and normalized. In one 
embodiment, this method reduces the number of clock cycles needed for an 
FMAC to complete its execution by one. That is, in one embodiment, one 
clock cycle is saved by the use of the apparatus and method of the present 

30 invention. Although one clock cycle alone is a small amount of time, with 

the ever increasing use of floating point calculations and concomitant 
reliance on the floating point capabilities of processors, a nontrivial increase 
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in overall processor performance results. In addition, only a minimal 
increase of on-chip hardware is required to accomplish the method and 
achieve the performance improvement. 

Figure 1 depicts a prior art floating point multiplier accumulator. 
Floating point multiplier accumulator (FMAC) 100 receives as input floating 
point numbers A, B, and C. The FMAC outputs the properly rounded result 
of ''(A X B + C )" based on the rounding mode. Multiplier 102 receives A, B 
and C and outputs what is known as SUM and CARRY. Propagate, kill, 
generate (PKG) generator 104 is coupled to multiplier 102 and receives SUM 
and CARRY. Using SUM and CARRY, the PKG generator produces (1) P, the 
product of A and B, (2) G, the sum of A and B, and (3) K, the product of the 
one's complement of A and the one's complement of B. That is: 

P- A + B 

G- AxB 

K = complement(A) x complement(B) 
Adder 106 is coupled to the PKG generator and receives as input P, K 
and G and uses P, K and G to determine the sum of SUM and CARRY. 
Leading zero anticipator (LZA) 108 is also coupled to the PKG generator and 
receives P, K and G from the PKG generator. Normalization shifter 110 
receives as input the result from adder 106 and the position of the decimal 
point from leading zero anticipator 108. Rounding unit 112 receives the 
normalized result from normalization shifter 110, and then increments, 
decrements or leaves unaffected the normalized result, depending on the 
rounding mode. The rounding mode is determined by, in one embodiment, 
reading the contents of a well-known register in the processor. The rounding 
mode is one of the rounding modes provided for in the LE.E.E. Floating Point 
Standard. Pursuant to this standard, the rounding mode may be round 
toward positive infinity, round toward negative infinity, round toward zero, 
and round toward nearest. In one prior art embodiment, rounding in the 
form of incrementing or decrementing requires at least one clock cycle. In 
this prior art implementation, the total time for the FMAC processing 
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includes the time needed to sequentially perform SUM plus CARRY addition 
and in adder 106 then perform rounding in rounding unit 112, if needed. 

Figure 2 depicts the flow of actions taken according to a prior art 
method of implementing a floating point multiplier accumulator. The prior 
art FMAC receives three floating point numbers designated as A, B and C as 
input, as shown in block 200. A x B + C is then computed in SUM and CARRY 
form, as shown in block 204. P, K and G are then generated using SUM and 
CARRY, as shown in block 208. SUM and CARRY are then added using P, K, 
and G, as shown in block 210. The position of where the decimal point is 
located is then determined by the leading zero anticipator, as shown in block 
212. The result, A x B + C, is then shifted according the result of the leading 
zero determination, as shown in block 220. The shifted result is then 
incremented or decremented, if needed, according to the rounding mode, as 
shown in block 224. In this prior art method, the adding SUM and CARRY 
and the determination of the leading zero may be performed in parallel. As 
mentioned above, in this prior art method, the total time for FMAC 
processing includes the time required to sequentially perform multiplication, 
generate P, K and G, perform addition, normalize and to perform rounding, if 
needed. 

Figure 3 depicts one embodiment of a floating point multiplier 
accumulator according to the present invention. Floating point multiplier 
accumulator (FMAC) 300 receives as input floating point numbers A, B, and 
C. Multiplier 304 receives A, B and C, produces A x B + C and outputs the 
result in SUM and CARRY form. Propagate, kill, generate (PKG) generator 
308 is coupled to multiplier 304 and receives the SUM and CARRY as input. 
The PKG generator produces P, K, and G using SUM and CARRY, and may be 
the same PKG generator as in the prior art. Adder 310, plus-oner 312, plus- 
two-er 314, and leading zero anticipator 316 are coupled to PKG generator 308 
and all receive P, K, and G as input. Adder 310 uses P, K and G to add SUM 
and CARRY. Plus-oner 312 uses P, K and G to add SUM and CARRY and 
increment the resulting sum by one. Plus-two-er 314 uses P, K and G to add 
SUM and CARRY and increment the resulting sum by two. Leading zero 
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anticipator (LZA) 316 determines the location of the decimal point of the 
result by computing a leading zero position. Multiplexor 320 receives the 
result of each of adder 310, plus-oner 312, and plus-two-er 314 and selects 
which of the results is appropriate responsive to a control signal received 
5 from rounding control 322. Rounding control 322 issues an appropriate 

signal responsive to the rounding mode and the output of leading zero 
anticipator 316. The rounding mode may be obtained from, in one 
embodiment, a register in the processor. Normalization shifter 326 receives 
as input the appropriate sum selected by multiplexor 320. According to this 
10 method, the prior art steps of adding followed by rounding are effectively 

achieved in parallel by the computations accomplished by adder 310, plus- 
oner 312, plus-two-er 314, multiplexor 320, and rounding control 322. That is, 
i the prior art rounding hardware is replaced with multiplexor 320, and 

1 rounding control 322 which complete their execution in less time than 

:15 traditional rounding, and, thus take the resulting FMAC requires less time to 

^ complete its computation. This results in a relatively small increase in 

i hardware required on the processor while increasing floating point 

computation throughput by, in one embodiment, one clock cycle. 
1 Figure 4 depicts the flow of actions taken according to the method of 

j20 implementing a floating point multiplier accumulator according to the 

I present invention. The FMAC receives three floating point numbers 

designated as A, B and C as input, as shown in block 400. A x B + C is then 
computed in SUM and CARRY form, as shown in block 404. F, K and G are 
then generated, as shown in block 408. FKG generator 408 may be the same as 
25 the FKG generator used in prior art systems. Four operations then occur in 

parallel: (1) SUM and CARRY are added together using P, K and G, as shown 
in block 410; (2) SUM and CARRY are added together using P, K and G and 
the resulting sum is then incremented by one, as shown in block 412; (3) SUM 
and CARRY are added together using P, K and G and the resulting sum is 
30 then incremented by two, as shown in block 414; and (4) the location of the 

decimal point is made by determination of the position of leading zeros, as 
shown in block 416. The appropriate result of blocks 410, 412, and 414 is then 
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selected according to the rounding mode and the position of the leading zero, 
as shown in block 420. In one embodiment, the rounding mode may be 
determined by examination of a register in the processor in which the FMAC 
resides. The result is then normalized according the outcome of the leading 
5 zero determination, as shown in block 430, By determining the result of the 

addition, the addition with incrementing by one, and the addition with 
incrementing by two, all of the possible results responsive to the rounding 
mode are predetermined. To appropriately round the result, the appropriate 
value produced by blocks 410, 412 and 414 in parallel is selected, eliminating 

10 the time consuming two step adding and rounding sequence taught in the 

prior art. As rounding is often necessary, the selection of the result 
responsive to the rounding mode saves time and increases throughput, 
particularly when performing numerous floating point computations. 

Although the method and apparatus described above are resident in a 

15 processor, the method may also be implemented in software and microcode 
in those processors that allow and provide for such an implementation. Such 
software may reside within a processor, in cache memory, in random access 
memory, etc. In addition, such software may be read by the processor during 
boot up as part of a basic input output system (BIOS) or similar startup 

20 sequence and may be read from a hard disk, floppy disk, stick memory device, 

programmable read only memory (PROM), flash memory, or any other kind 
of machine readable medium. 

In the foregoing specification, the invention has been described with 
reference to specific embodiments thereof. It will, however, be evident that 

25 various modifications and changes can be made thereto without departing 

from the broader spirit and scope of the invention as set forth in the 
appended claims. The specification and drawings are, accordingly, to be 
regarded in an illustrative rather than a restrictive sense. Therefore, the scope 
of the invention should be limited only by the appended claims. 
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CLAIMS 



What is claimed is: 

1 1. A significand portion of a floating point multiply accumulator 

2 (FMAC) comprising: 

3 a multiplier receiving a first input significand, a second input significand, 

4 and a third input significand; 

5 a propagate, kill, generate generator (PKG generator) coupled to the 

6 multiplier; 

7 an adder, a plus-oner, a plus-two-er and a leading zero anticipator (LZA) 

8 each coupled to the PKG generator; 

9 a rounding control unit coupled to the LZA; 

Oio a multiplexor coupled to each of the adder, the plus-oner, the plus-two-er 

QUI and the rounding control unit; and 

1:Jl2 and a normalization shifter coupled to the multiplexor and the LZA. 

rU 1 2. The significand portion of Claim 1 wherein the multiplier outputs a 

r 2 sum value and a carry value. 

2^ 1 3, The significand portion of Claim 2 wherein the PKG generator 

^ J 2 computes a propagate value (P), a kill value (K) and a generate value (G) based 

S| 3 on the sum value and the carry value. 

1 4. The significand portion of Claim 3 wherein in parallel the adder 

2 adds the sum value and the carry value using P, K, and G; the plus-oner adds the 

3 sum value and the carry value using P, K, and G and increments by one; and the 

4 plus-two-er adds the sum value and the carry value using P, K, and G and 

5 increments by two. 

1 5, The significand portion of Claim 4 wherein the LZA computes in 

2 parallel with the adder, the plus-oner, and the plus-two-er. 
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1 6. The significand portion of Claim 1 wherein the rounding control 

2 unit reads a rounding mode from a register in a processor in which the FMAC 

3 resides. 

1 7. The significand portion of Claim 1 wherein the normalization 

2 shifter and the rounding control unit each receive a leading zero position 

3 indication from the LZA. 

1 8. The significand portion of Claim 1 wherein the multiplexor 

2 produces an output result responsive to the rounding control unit. 

1 9. A floating point multiply accumulator (FMAC) comprising: 

2 a multiplier; 

3 a propagate, kill, generate generator (FKG generator) to produce a 

4 propagate value (P), a kill value (K) and a generate value (G) coupled to the 

5 multiplier; 

6 an adder, a plus-oner, a plus-two-er and a leading zero anticipator (LZA) 

7 each coupled to the PKG generator in parallel; 

8 a rounding control unit coupled to the LZA and coupled to a multiplexor, 

9 the multiplexor outputting a result from one of the adder, the plus-oner, and the 

10 plus-two-er responsive to the rounding control unit; and 

11 and a normalization shifter coupled to the multiplexor and the LZA. 

1 10. The FMAC of Claim 9 wherein the multiplier produces a product of 

2 a first floating point number and a second floating point number added to a third 

3 floating point number as a sum value and a carry value. 

1 11. The FMAC of Claim 10 wherein in parallel the adder adds the sum 

2 value and the carry value using P, K, and G; the plus-oner adds the sum value 

3 and the carry value using P, K, and G and increments by one; and the plus-two-er 

4 adds the sum value and the carry value using P, K, and G and increments by two. 

9 
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1 12. The FMAC of Claim 9 wherein the rounding control unit outputs a 

2 select signal to the multiplexor based on a rounding mode and the decimal point 

3 position. 

1 13. The FMAC of Claim 9 wherein the normalization shifter 

2 normalizes based on the decimal point position. 

1 14. A floating point multiply accumulator (FMAC) comprising: 

2 a means for multiplying a first significand and a second significand and 

3 adding a third significand to produce a sum value and a carry value; 

4 a means for computing a propagate value, a kill value, and a generate 

5 value coupled to the means for multiplying; 

6 a first means for adding the sum value to the carry value; 

7 a second means for adding the sum value to the carry value and 

8 incrementing by one; 

9 a third means for adding the sum value to the carry value and 

10 incrementing by two; 

11 a means for determining a leading zero position, such that the first means 

12 for adding, the second means for adding, the third means for adding, and the 

13 means for determining are coupled in parallel to the means for computing; 

14 a means for controlling responsive to the means for determining and a 

15 rounding mode, the means for controlling further coupled to a means for 

16 selecting, the means for selecting outputting a result from one of the first means 

17 for adding, the second means for adding, and the third means for adding 

18 responsive to the means for controlling; and 

19 and a means for normalizing coupled to the means for selecting and the 

20 means for determining. 

1 15. The FMAC of Claim 14 wherein the means for controlling reads the 

2 rounding mode from a register in a processor in which the FMAC resides. 
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1 16. The FMAC of Claim 14 wherein the means for normalizing is 

2 responsive to the means for determining. 

1 17. A method in a floating point multiply accumulator (FMAC) 

2 comprising: 

3 receiving a first floating point number, a second floating point number 

4 and a third floating point number; 

5 computing a product of the first floating point number and the second 

6 floating point number and adding a third floating point number to produce a 

7 sum value and a carry value; 

8 computing a propagate value, a kill value and a generate value based on 

9 the sum value and the carry value; 

10 simultaneously adding the sum value to the carry value to create a first 

11 result, adding the sum value to the carry value and incrementing by one to 

12 create a second result, adding the sum value to the carry value and incrementing 

13 by two to create a third result, and determining a decimal point position; 

14 selecting one of the first result, the second result and the third result 

15 responsive to a rounding mode and the decimal point position as a selected 

16 result; and 

17 normalizing the selected result based on the decimal point position. 

1 18. The method of Claim 17 further comprising; 

2 reading the rounding mode from a register in a processor in which the 

3 FMAC resides. 

1 19. The method of Claim 17 wherein normalizing comprises: 

2 shifting the bits in the selected result. 

1 20. The method of Claim 17 wherein the propagate value, the kill 

2 value and the generate value are used by the adder, the plus-oner and the plus- 

3 two-er to compute the first result, the second result and the third result. 

11 
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1 21. A machine readable medium containing instructions which, when 

2 executed by a processor, cause a machine to perform operations comprising: 

3 receiving a first floating point number, a second floating point number 

4 and a third floating point number; 

5 computing a product of the first floating point number and the second 

6 floating point number and adding a third floating point number to produce a 

7 sum value and a carry value; 

8 computing a propagate value, a kill value and a generate value based on 

9 the sum value and the carry value; 

10 simultaneously adding the sum value to the carry value to create a first 

11 result, adding the sum value to the carry value and incrementing by one to 

12 create a second result, adding the sum value to the carry value and incrementing 

13 by two to create a third result, and determining a decimal point position; 

14 selecting one of the first result, the second result and the third result 

15 responsive to a rounding mode and the decimal point position as a selected 

16 result; and 

17 normalizing the selected result based on the decimal point position. 

1 22. The machine readable medium of Claim 21 containing instructions 

2 which, when executed by a processor, cause the machine to perform further 

3 operations comprising: 

4 reading the rounding mode from a register in a processor in which the 

5 FMAC resides. 

1 23. The machine readable medium of Claim 21 wherein normalizing 

2 comprises: 

3 shifting the bits in the selected result. 
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ABSTRACT 

A method and apparatus to increase the performance of a floating point 
multiplier accumulator (FMAC). The method comprises receiving three floating 
point numbers and computing a product of the first floating point number and 
the second floating point number and adding a third floating point number to 
produce a sum value and a carry value. A propagate value, a kill value and a 
generate value are then computed based on the sum value and the carry value. 
Simultaneously the sum value is added to the carry value to create a first result, 
the sum value is added to the carry value and incremented by one to create a 
second result, the sum value is added to the carry value and incremented by two 
to create a third result, and a decimal point position is determined. One of the 
first result, the second result and the third result is then selected responsive to a 
rounding mode and the decimal point position. The selected result is 
normalized based on the decimal point position. The apparatus comprises a 
multiplier with a propagate, kill, generate generator (PKG generator) coupled to 
it. An adder, a plus-oner, a plus-two-er and a leading zero anticipator (LZA) are 
each coupled to the PKG generator in parallel. A rounding control unit is 
coupled to the LZA and coupled to a multiplexor that outputs a result from one 
of the adder, the plus-oner, and the plus-two-er responsive to the rounding 
control unit. A normalization shifter is coupled to the multiplexor and the LZA. 
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